{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WDbhGHtG8jFE"
      },
      "source": [
        "# Build an Embeddings index from a data source\n",
        "\n",
        "In Part 1, we gave a general overview of txtai, the backing technology and examples of how to use it for similarity searches. Part 2 covered an embedding index with a larger dataset.\n",
        "\n",
        "For real world large-scale use cases, data is often stored in a database (Elasticsearch, SQL, MongoDB, files, etc). Here we'll show how to read from SQLite, build a Embedding index backed by word embeddings and run queries against the generated Embeddings index.\n",
        "\n",
        "This example covers functionality found in the [paperai](https://github.com/neuml/paperai) library. See that library for a full solution that can be used with the dataset discussed below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UQ0fCwXn9bcH"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies. Since this notebook is building word vectors, we need to install the similarity extras package."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "czPYSA2Q9ZHO"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[similarity]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SN9SCZKQ9fJF"
      },
      "source": [
        "# Download data\n",
        "\n",
        "This example is going to work off a subset of the [CORD-19](https://www.semanticscholar.org/cord19) dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.\n",
        "\n",
        "The following download is a SQLite database generated from a [Kaggle notebook](https://www.kaggle.com/davidmezzetti/cord-19-slim/output). More information on this data format, can be found in the [CORD-19 Analysis](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) notebook."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TONQ4_Kv9dtd"
      },
      "source": [
        "%%capture\n",
        "!wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz\n",
        "!gunzip tests.gz\n",
        "!mv tests articles.sqlite"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bzdaJiZYBIHE"
      },
      "source": [
        "# Build Word Vectors\n",
        "\n",
        "This example will build a search system backed by word embeddings. Word embeddings provide a good tradeoff of performance to functionality for a similarity search system. With that being said, transformer models are making great progress in scaling performance down to smaller models and are the preferred vector backend for txtai in most cases.\n",
        "\n",
        "For this notebook, we'll build our own custom embeddings for demo purposes. A number of pre-trained word embedding models are available:\n",
        "\n",
        " - [General language models from pymagnitude](https://github.com/plasticityai/magnitude)\n",
        " - [CORD-19 fastText](https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fJcn-CAH-u3K",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "ab5a4340-13fc-4b47-b881-9388e2534ece"
      },
      "source": [
        "import os\n",
        "import sqlite3\n",
        "import tempfile\n",
        "\n",
        "from txtai.pipeline import Tokenizer\n",
        "from txtai.vectors import WordVectors\n",
        "\n",
        "print(\"Streaming tokens to temporary file\")\n",
        "\n",
        "# Stream tokens to temp working file\n",
        "with tempfile.NamedTemporaryFile(mode=\"w\", suffix=\".txt\", delete=False) as output:\n",
        "  # Save file path\n",
        "  tokens = output.name\n",
        "\n",
        "  db = sqlite3.connect(\"articles.sqlite\")\n",
        "  cur = db.cursor()\n",
        "  cur.execute(\"SELECT Text from sections\")\n",
        "\n",
        "  for row in cur:\n",
        "    output.write(\" \".join(Tokenizer.tokenize(row[0])) + \"\\n\")\n",
        "\n",
        "  # Free database resources\n",
        "  db.close()\n",
        "\n",
        "# Build word vectors model - 300 dimensions, 3 min occurrences\n",
        "WordVectors.build(tokens, 300, 3, \"cord19-300d\")\n",
        "\n",
        "# Remove temporary tokens file\n",
        "os.remove(tokens)\n",
        "\n",
        "# Show files\n",
        "!ls -l"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Streaming tokens to temporary file\n",
            "Building 300 dimension model\n",
            "Converting vectors to magnitude format\n",
            "total 78948\n",
            "-rw-r--r-- 1 root root  8065024 Dec  7 15:48 articles.sqlite\n",
            "-rw-r--r-- 1 root root 24145920 Dec 27 17:42 cord19-300d.magnitude\n",
            "-rw-r--r-- 1 root root 48625387 Dec 27 17:41 cord19-300d.txt\n",
            "drwxr-xr-x 1 root root     4096 Dec  3 14:33 sample_data\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_UxcC1-JGH-d"
      },
      "source": [
        "# Build an embeddings index\n",
        "\n",
        "The following steps build an embeddings index using the word vector model just created. This model builds a BM25 + fastText index. BM25 is used to build a weighted average of the word embeddings for a section. More information on this method can be found in this [Medium article](https://towardsdatascience.com/building-a-sentence-embedding-index-with-fasttext-and-bm25-f07e7148d240?gi=79da927aa10)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5PrrxGRPGHqX",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "61bf7211-6757-4147-8f2f-e4d1ebe58e11"
      },
      "source": [
        "import sqlite3\n",
        "\n",
        "import regex as re\n",
        "\n",
        "from txtai.embeddings import Embeddings\n",
        "from txtai.pipeline import Tokenizer\n",
        "\n",
        "def stream():\n",
        "  # Connection to database file\n",
        "  db = sqlite3.connect(\"articles.sqlite\")\n",
        "  cur = db.cursor()\n",
        "\n",
        "  # Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.\n",
        "  cur.execute(\"SELECT Id, Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND tags is not null\")\n",
        "\n",
        "  count = 0\n",
        "  for row in cur:\n",
        "    # Unpack row\n",
        "    uid, name, text = row\n",
        "\n",
        "    # Only process certain document sections\n",
        "    if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n",
        "      # Tokenize text\n",
        "      tokens = Tokenizer.tokenize(text)\n",
        "\n",
        "      document = (uid, tokens, None)\n",
        "\n",
        "      count += 1\n",
        "      if count % 1000 == 0:\n",
        "        print(\"Streamed %d documents\" % (count), end=\"\\r\")\n",
        "\n",
        "      # Skip documents with no tokens parsed\n",
        "      if tokens:\n",
        "        yield document\n",
        "\n",
        "  print(\"Iterated over %d total rows\" % (count))\n",
        "\n",
        "  # Free database resources\n",
        "  db.close()\n",
        "\n",
        "# BM25 + fastText vectors\n",
        "embeddings = Embeddings({\"path\": \"cord19-300d.magnitude\",\n",
        "                         \"scoring\": \"bm25\",\n",
        "                         \"pca\": 3})\n",
        "\n",
        "# Build scoring index if scoring method provided\n",
        "if embeddings.config.get(\"scoring\"):\n",
        "  embeddings.score(stream())\n",
        "\n",
        "# Build embeddings index\n",
        "embeddings.index(stream())\n"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Iterated over 21499 total rows\n",
            "Iterated over 21499 total rows\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zHk24su3e_gb"
      },
      "source": [
        "# Query data\n",
        "\n",
        "The following runs a query against the embeddings index for the terms \"risk factors\". It finds the top 5 matches and returns the corresponding documents associated with each match."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CRbDhvvDKEl-",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 293
        },
        "outputId": "774f8085-01db-49e2-f025-4fe68afca594"
      },
      "source": [
        "import pandas as pd\n",
        "\n",
        "from IPython.display import display, HTML\n",
        "\n",
        "pd.set_option(\"display.max_colwidth\", None)\n",
        "\n",
        "db = sqlite3.connect(\"articles.sqlite\")\n",
        "cur = db.cursor()\n",
        "\n",
        "results = []\n",
        "for uid, score in embeddings.search(\"risk factors\", 5):\n",
        "  cur.execute(\"SELECT article, text FROM sections WHERE id = ?\", [uid])\n",
        "  uid, text = cur.fetchone()\n",
        "\n",
        "  cur.execute(\"SELECT Title, Published, Reference from articles where id = ?\", [uid])\n",
        "  results.append(cur.fetchone() + (text,))\n",
        "\n",
        "# Free database resources\n",
        "db.close()\n",
        "\n",
        "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\"])\n",
        "\n",
        "# It has been reported that displaying HTML within VSCode doesn't work.\n",
        "# When using VSCode, the data can be exported to an external HTML file to view.\n",
        "# See example below.\n",
        "\n",
        "# htmlData = df.to_html(index=False)\n",
        "# with open(\"data.html\", \"w\") as file:\n",
        "#     file.write(htmlData)\n",
        "\n",
        "display(HTML(df.to_html(index=False)))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th>Title</th>\n",
              "      <th>Published</th>\n",
              "      <th>Reference</th>\n",
              "      <th>Match</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n",
              "      <td>2020-05-21 00:00:00</td>\n",
              "      <td>https://doi.org/10.1002/cpt.1910</td>\n",
              "      <td>Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>Work-related and Personal Factors Associated with Mental Well-being during COVID-19 Response: A Survey of Health Care and Other Workers</td>\n",
              "      <td>2020-06-11 00:00:00</td>\n",
              "      <td>http://medrxiv.org/cgi/content/short/2020.06.09.20126722v1?rss=1</td>\n",
              "      <td>Poor family supportive behaviors by supervisors were also associated with these outcomes [1.40 (1.21 - 1.62), 1.69 (1.48 - 1.92), 1.54 (1.44 - 1.64)].</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>No evidence that androgen regulation of pulmonary TMPRSS2 explains sex-discordant COVID-19 outcomes</td>\n",
              "      <td>2020-04-21 00:00:00</td>\n",
              "      <td>https://doi.org/10.1101/2020.04.21.051201</td>\n",
              "      <td>In addition to male sex, smoking is a risk factor for COVID-19 susceptibility and poor clinical outcomes .</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>Current status of potential therapeutic candidates for the COVID-19 crisis</td>\n",
              "      <td>2020-04-22 00:00:00</td>\n",
              "      <td>https://doi.org/10.1016/j.bbi.2020.04.046</td>\n",
              "      <td>There was no difference on 28-day mortality between heparin users and nonusers.</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n",
              "      <td>2020-03-15 00:00:00</td>\n",
              "      <td>https://doi.org/10.7150/ijbs.45134</td>\n",
              "      <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XSf68I-ZfXOG"
      },
      "source": [
        "# Extracting additional columns from query results\n",
        "\n",
        "The example above uses the Embeddings index to find the top 5 best matches. In addition to this, an Extractor instance (this will be explained further in part 5) is used to ask additional questions over the search results, creating a richer query response."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TLVOTQJchvTi"
      },
      "source": [
        "%%capture\n",
        "from txtai.pipeline import Extractor\n",
        "\n",
        "# Create extractor instance using qa model designed for the CORD-19 dataset\n",
        "extractor = Extractor(embeddings, \"NeuML/bert-small-cord19qa\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "19fmKawThs6d",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 293
        },
        "outputId": "b7cd40e3-a87c-419d-f520-b7795607cebc"
      },
      "source": [
        "db = sqlite3.connect(\"articles.sqlite\")\n",
        "cur = db.cursor()\n",
        "\n",
        "results = []\n",
        "for uid, score in embeddings.search(\"risk factors\", 5):\n",
        "  cur.execute(\"SELECT article, text FROM sections WHERE id = ?\", [uid])\n",
        "  uid, text = cur.fetchone()\n",
        "\n",
        "  # Get list of document text sections to use for the context\n",
        "  cur.execute(\"SELECT Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND article = ? ORDER BY Id\", [uid])\n",
        "  texts = []\n",
        "  for name, txt in cur.fetchall():\n",
        "    if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n",
        "      texts.append(txt)\n",
        "\n",
        "  cur.execute(\"SELECT Title, Published, Reference from articles where id = ?\", [uid])\n",
        "  article = cur.fetchone()\n",
        "\n",
        "  # Use QA extractor to derive additional columns\n",
        "  answers = extractor([(\"Risk Factors\", \"risk factors\", \"What risk factors?\", False),\n",
        "                       (\"Locations\", \"hospital country\", \"What locations?\", False)], texts)\n",
        "\n",
        "  results.append(article + (text,) + tuple([answer[1] for answer in answers]))\n",
        "\n",
        "# Free database resources\n",
        "db.close()\n",
        "\n",
        "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\", \"Risk Factors\", \"Locations\"])\n",
        "display(HTML(df.to_html(index=False)))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th>Title</th>\n",
              "      <th>Published</th>\n",
              "      <th>Reference</th>\n",
              "      <th>Match</th>\n",
              "      <th>Risk Factors</th>\n",
              "      <th>Locations</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n",
              "      <td>2020-05-21 00:00:00</td>\n",
              "      <td>https://doi.org/10.1002/cpt.1910</td>\n",
              "      <td>Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .</td>\n",
              "      <td>sex, obesity, genetic factors and mechanical factors</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>Work-related and Personal Factors Associated with Mental Well-being during COVID-19 Response: A Survey of Health Care and Other Workers</td>\n",
              "      <td>2020-06-11 00:00:00</td>\n",
              "      <td>http://medrxiv.org/cgi/content/short/2020.06.09.20126722v1?rss=1</td>\n",
              "      <td>Poor family supportive behaviors by supervisors were also associated with these outcomes [1.40 (1.21 - 1.62), 1.69 (1.48 - 1.92), 1.54 (1.44 - 1.64)].</td>\n",
              "      <td>Poor family supportive behaviors</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>No evidence that androgen regulation of pulmonary TMPRSS2 explains sex-discordant COVID-19 outcomes</td>\n",
              "      <td>2020-04-21 00:00:00</td>\n",
              "      <td>https://doi.org/10.1101/2020.04.21.051201</td>\n",
              "      <td>In addition to male sex, smoking is a risk factor for COVID-19 susceptibility and poor clinical outcomes .</td>\n",
              "      <td>Higher morbidity and mortality</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>Current status of potential therapeutic candidates for the COVID-19 crisis</td>\n",
              "      <td>2020-04-22 00:00:00</td>\n",
              "      <td>https://doi.org/10.1016/j.bbi.2020.04.046</td>\n",
              "      <td>There was no difference on 28-day mortality between heparin users and nonusers.</td>\n",
              "      <td>elicited strong inflammatory responses are favorable or detrimental</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n",
              "      <td>2020-03-15 00:00:00</td>\n",
              "      <td>https://doi.org/10.7150/ijbs.45134</td>\n",
              "      <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n",
              "      <td>sex (male), age (≥60), and severe pneumonia</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ColTLy--rWfR"
      },
      "source": [
        "In the example above, the Embeddings index is used to find the top N results for a given query. On top of that, a question-answer extractor is used to derive additional columns based on a list of questions. In this case, the \"Risk Factors\" and \"Location\" columns were pulled from the document text."
      ]
    }
  ]
}